Enhanced antibody-antigen structure prediction from molecular docking using AlphaFold2

Predicting the structure of antibody-antigen complexes has tremendous value in biomedical research but unfortunately suffers from a poor performance in real-life applications. AlphaFold2 (AF2) has provided renewed hope for improvements in the field of protein–protein docking but has shown limited success against antibody-antigen complexes due to the lack of co-evolutionary constraints. In this study, we used physics-based protein docking methods for building decoy sets consisting of low-energy docking solutions that were either geometrically close to the native structure (positives) or not (negatives). The docking models were then fed into AF2 to assess their confidence with a novel composite score based on normalized pLDDT and pTMscore metrics after AF2 structural refinement. We show benefits of the AF2 composite score for rescoring docking poses both in terms of (1) classification of positives/negatives and of (2) success rates with particular emphasis on early enrichment. Docking models of at least medium quality present in the decoy set, but not necessarily highly ranked by docking methods, benefitted most from AF2 rescoring by experiencing large advances towards the top of the reranked list of models. These improvements, obtained without any calibration or novel methodologies, led to a notable level of performance in antibody-antigen unbound docking that was never achieved previously.


Figure S3
. Success rates broken down for each model quality (see Methods) for the docking methods as a function of the number of top models considered from the ranked-list.The rates are shown for models ranked using the docking scores and the AF2Composite score for the boundbackbone set.The rates were plotted on a logarithmic scale for better visibility.The success was evaluated using the docking-generated (left) and AF2-generated (right) models relative to the corresponding crystal structure.

Figure S8.
Influence of the scoring scheme on the success rates of ProPOSE and ZDOCK on the bound-backbone set.The scoring schemes are based exclusively on the standardized docking scores (Zscore), pLDDT (ZpLDDT), pTMscore (ZpTMscore) or from an additive combination of the pLDDT and pTMscore (ZpLDDTpTMscore) weighted by deviations of the AF2-generated model from its docking-generated template (ZpLDDTpTMscoreCAPRI).The CAPRI metrics were used as proxy to qualify the structural agreement between two structures.The weights were set at 0.25, 0.50, 0.75 and 1.00 for incorrect, acceptable, medium and high quality, respectively.a N/C: not calculated due to bias in assessing AF2 performance based on structures present during its development phase; b N/A: not applicable; c N/M: not calculated due to 5-models limit of AF2-Multimer.

Method
Top

Figure S1 .
Figure S1.Quality assessment of the AlphaFold2-generated models in relation to its provided docking-generated template structure.Distribution in the (A) fraction of conserved template contacts, (B) interface RMSD and (C) ligand RMSD between the AF2-generated model and its provided docking-generated template for the decoys in the unbound-backbone set.The RMSD calculations only include the Cɑ atoms.All decoys generated with ProPOSE, ZDOCK, PIPER and ClusPro were combined.The median of the distribution is indicated by the dashed white line with values 0.54, 1.27 and 3.22.(D) Transitions from the docking-generated models to the corresponding AF2-generated models in terms of model quality relative to the crystal structure.Transitions between the high-quality and incorrect classes were not observed.Colors denote structure quality levels as defined by CAPRI classification (see Methods): high (green), medium (yellow), acceptable (beige) and incorrect (grey).

Figure S4 .
Figure S4.Success rates broken down for each model quality (see Methods) for the docking methods as a function of the number of top models considered from the ranked-list.The rates are shown for models ranked using the docking scores and the AF2Composite score for the unboundbackbone set.The rates were plotted on a logarithmic scale for better visibility.The success was evaluated using the docking-generated (left) and AF2-generated (right) models relative to the corresponding crystal structure.

Figure S5 .Figure S6 .
Figure S5.Influence of joining the protein chains using a 50-residue-long artificial linker (Linker) or using a 200-residue-long indexing gap (No linker) on the success rates of ProPOSE and ZDOCK on the bound-backbone set.

Figure S7 .
Figure S7.Skewness of distribution for the normalized AF2Composite scores for the individual systems in the decoys sets (Decoys) in comparison to randomized normally-distributed data (Randomized).

Figure S9 .
Figure S9.Correlation between the two components of the AF2Composite score.The scatterplots are shown separately for the four CAPRI model quality levels (see Methods): (A) high, (B) medium, (C) acceptable and (D) incorrect.All models generated by ProPOSE, ZDOCK, PIPER and ClusPro on the bound-backbone set are shown.

Figure S10 .
Figure S10.Smoothed density distribution of the relative AF2 confidence scores for the boundbackbone models in the negative (incorrect) and positive (acceptable-, medium-and highquality) sets.The confidence scores are relative to those obtained by AlphaFold2 when provided with the crystal structures as input.The absolute (A) pTMscore and (B) pLDDT values are used as reference confidence scores for comparison due to the inability of deriving the composite score from a single structure in the case of the crystal.The confidence scores decrease as the quality of the model degrades.

Figure S11 .Figure S12 .
Figure S11.Influence of providing a structural template to AF2 that has its side-chains truncated (Alanine) or that has a full atomistic representation of the side-chains with preservation of the original sequence (Sequence) on the success rates of ProPOSE and ZDOCK on the boundbackbone set.Only top-50 models were rescored by AF2.

Figure S13 .
Figure S13.Error estimate on the AF2Composite as a function of the number of docking-generated models per system, i.e. the ensemble size.The error is calculated by subtracting the AF2Composite scores from models collected from smaller samples (ensembles of size below 100) to the theoretical AF2Composite scores from the population (ensemble of size 100).The unboundbackbone ProPOSE-generated models set were used for plotting.As the size of the ensemble increases, the estimate values approach those of the population.

Figure S14 .
Figure S14.Runtimes per antibody-antigen model on the Compute Canada superclusters used in the computation of AlphaFold2 calculations.The clusters are exclusively comprised of A100 (narval), V100 (beluga) or are composed of a combination of V100 and P100 (cedar and graham) GPU units.The runtime is shown for AlphaFold2 using an explicit artificial-linker (red) or by introducing a residue indexing gap in the sequence (blue).